New visualization tools for numeric distributional data tables

From pre-processing to interpretation

Antonio Irpino, Ph.D.

Dept. of Mathematics and Physics
University of Campania L. Vanvitelli
Caserta, Italy

Thursday, the 9th of November, 2023

Layout

1) Aggregate and distributional data

Distributions are the numbers of the future.
Schweizer (1984)

2) Visualizing a table of (1D) distributions

3) Visualizing a single row (through eye iris or flowers)

The greatest value of a picture is when it forces us to notice what we never expected to see.
John Tukey

4) Visualizing large distributional data tables (extending an heatmap)

Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.
John Tukey

5) An application on Chile climatic data

Aggreagate and distributional data: numeric distributional data

Let’s see an example: BLOOD dataset from the HistDAWass R package.

It is a classical (in the Symbolic Data Analysis community) dataset describing

  • 14 typologies of patients;
  • 3 distributional variables;

after aggregating a set of raw data from a hospital. See: Billard and Diday (2006)

Cholesterol
Hemoglobin
Hematocrit
name V1 bins p1 V2 bins p2 V3 bins p3
[80 ; 100] 0.025 [12 ; 12.9] 0.050 [35 ; 37.5] 0.025
[100 ; 120] 0.075 [12.9 ; 13.2] 0.112 [37.5 ; 39] 0.075
[120 ; 135] 0.175 [13.2 ; 13.5] 0.212 [39 ; 40.5] 0.188
u1: F-20 [135 ; 150] 0.250 [13.5 ; 13.8] 0.201 [40.5 ; 42] 0.387
[150 ; 165] 0.200 [13.8 ; 14.1] 0.188 [42 ; 45.5] 0.287
[165 ; 180] 0.162 [14.1 ; 14.4] 0.137 [45.5 ; 47] 0.038
[180 ; 200] 0.088 [14.4 ; 14.7] 0.075
[200 ; 240] 0.025 [14.7 ; 15] 0.025
[80 ; 100] 0.013 [10.5 ; 11] 0.007 [31 ; 33] 0.046
[100 ; 120] 0.088 [11 ; 11.3] 0.039 [33 ; 35] 0.171
[120 ; 135] 0.154 [11.3 ; 11.6] 0.082 [35 ; 36.5] 0.295
[135 ; 150] 0.253 [11.6 ; 11.9] 0.174 [36.5 ; 38] 0.243
u2: F-30 [150 ; 165] 0.210 [11.9 ; 12.2] 0.216 [38 ; 39.5] 0.170
[165 ; 180] 0.177 [12.2 ; 12.5] 0.266 [39.5 ; 41] 0.072
[180 ; 195] 0.066 [12.5 ; 12.8] 0.157 [41 ; 44] 0.003
[195 ; 210] 0.026 [12.8 ; 14] 0.059
[210 ; 240] 0.013
[155 ; 170] 0.067 [10.8 ; 11.2] 0.133 [33.5 ; 35.5] 0.133
[170 ; 185] 0.133 [11.2 ; 11.6] 0.067 [35.5 ; 37.5] 0.267
[185 ; 200] 0.200 [11.6 ; 12] 0.134 [37.5 ; 39.5] 0.267
u14: M-80+ [200 ; 215] 0.267 [12 ; 12.4] 0.333 [39.5 ; 41.5] 0.133
[215 ; 230] 0.200 [12.4 ; 12.8] 0.200 [41.5 ; 43] 0.200
[230 ; 245] 0.067 [12.8 ; 13.2] 0.133
[245 ; 260] 0.066

The first two and the last typology of patient in the BLOOD dataset.

Numerical distributional dataset

A distributional dataset is a classical table with \(N\) observations on the rows and \(P\) variables, indexing the columns, such that the generic term \(y_{ij}\) is a numerical univariate distribution

\[y_{ij}\sim f_{ij}(x_j)\] where \(x_j\in D_j \subset \Re\) and \(f_{ij}(x_j)\geq 0\),

  • \(\int\limits_{x_j \subset D_j}f_{ij}(x_j)dx_{j}=1\), if the distribution has a continuous support;
  • \(\sum\limits_{x_j\in D_j}{ f_{ij}(x_j)}=1\), if the distribution has a discrete support.

The BLOOD dataset

The basic plot for the \(i\)-th observation

The \(i\)-th observation is the vector \(y_i=[y_{i1},\ldots,y_{ij},\dots,y_{iP}]\)

Steps :

  1. Domain discretization

    • For continuous variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, fixing a \(K_j\) integer value we partition \(D_j\) into \(K_j\) equi-width intervals (bins) of values, such that: \[D_j=\left\{ B_{jk}=(a_k,b_k], \lvert \, b_k>a_k,\, k=1,\ldots,K_j\, , \bigcup_{k=1}^KB_{jk}=[\min(D_j),\max(D_j)],B_{jk}\cap B_{jk'}=\emptyset, \text{ for } k\neq k' \right\} \]
    • For discete variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, being \(\# D_j=K_j\) the cardinality of \(D_j\), we consider the elements of \(D_j\).
  2. Choice of a divergent colour palette_ We consider a divergent color palette with \(K_j\) levels, such that \(K_1\) represent the lowest category and \(K_j\) the highest one.

  3. Stacked percentage barcharts We compute the mass observed in each bin/category for each \(y_{ij}\)
    For the \(Y_i\) observation, \(P\) bars are generated. The order of the bar can be decided accordingly to the user preferences, or can be suggested by a correlation analysis for all the data in advance (one may cluster the distributional variables using a hierarchical clustering based on the Wasserstein correlation and then using the order returned by after the aggregation).

  4. Polar coordinates Polar coordinates allow to represent the stacked barcharts as circles that mimics an Eye Iris.
    We called this plot Eye Iris plot (EI plot.)

Example using BLOOD data

The extremes of the domains of the variables

  • Range of Cholesterol [ 80 ; 270 ]

  • Range of Hemoglobin [ 10.2 ; 15 ]

  • Range of Hematocrit [ 30 ; 47 ]

Choice of \(K\) and of a color palette

We fix \(K=50\) and we will use a color palette from Red (low values), passing through Yellow (middle values) to Green (high values).

Now, let’s take the first observation

Recode the distribution according to \(K=50\) partition of the domains.

Since the bins represent classes of values, we can consider them as ranked levels of the domain.

We propose to see all the three distributions using a stacked percentage barchart as follows. Note that each level of color has a area that is proportional to the mass associated with each bin.

The dashed line is positioned at level \(0.5\) suggesting where the median of each distribution is positioned taking into consideration the level of color associated with the bin of the respective domain.

But, this kind of visualization is not so immediate for comparing several observations. Let’s see an example:

For this reason, we propose to use a plot based on polar coordinates, but adding pupil for reducing the distortion due to the polar transformation, as follows:

Since a human is able to catch eyes shapes and color, we believe that this kind of visualization can be more interpretable. For example, let’s see all the 14 observations together.

Interpretation

According to the filling colours we can compare both observations and distributional values.

The Enriched plot

We propose to add information about dispersion and skewness.

The dispersion

Each variable in the dataset may have a different dispersion. Each distributional variable has its dispersion accounted by its proper standard deviation \(\sigma_{ij}\). We normalize each standard deviation \(\sigma_{ij}\) by the maximum standard deviation of observed for the the \(j\)-th variable \(\max(\sigma_{ij})\) where \(i=1,\ldots,N\). A segment, centered in the respective sector, allow to perceive the dispersion associated with each distribution.

The skewness

Each \(y_{ij}\) has its skewness value computed via the Third standardized moment \(\gamma_{ij}\).

We represent the skewness of \(y_{ij}\) external to the dashed circle if it is positive, while it is positioned internally if it is negative. The distance from the dashed circle represent the absolute value of the skewness index. If the segment is very close to the dashed circle, it means that the distribution is almost symmetric.

An example applied to Hierarchical clustering

in a PCA

Visualizing all the data table

A distributional heatmap

The EI plots can be useful for medium-sized data tables: less than 20 units or variables.

In any case, in this slides we present an alternative represantation of the BLOOD dataset by adapting the classical heatmap plot to distributional data.

In distributional heatmaps each tile contains a visualization of a row-column distribution.

The novelty

We substitute a single tile using horinzontal stacked percentage bars that are constructed after binning each variable’s domain and using a diverging color palette, like for the EI plot.

A comparison between

Classical visualization

Distributional heatmap (new!!)

Advantages of heatmaps

Clustering rows

Clustering columns

Advantages of heatmaps, clustering both

An application to Chile climatic data (1970-2020)

The data

We downloaded from https://www.worldclim.org/data/worldclim21.html

The website contains the WorldClim version 2.1 climate data for 1970-2000.

Data was released in January 2020.

There are monthly climate data for minimum, mean, and maximum temperature, precipitation, solar radiation, wind speed, water vapor pressure, and for total precipitation.

There are also 19 “bioclimatic” variables.

The data is available at the four spatial resolutions, between 30 seconds (~1 km2) to 10 minutes (~340 km2).

Each download is a “zip” file containing 12 GeoTiff (.tif) files, one for each month of the year (January is 1; December is 12).

We considered 30 seconds (~1 km2).
We have about 3 million of zones of Chile and compute the distribution for each of the 56 provinces considering the single spatial units belonging to it (not the distribution of the time series values which are not available).
A distributional data table is constructed having 56 rows (the provinces) and 48 distributional variables: 12 monthly average temperature, (log) precipitations, wind speed, solar radiation.

The classical (usefulness) visualization

Let’s see the distributional heatmap

Lets see the seven clusters

Mapping the results

In the map, you can click on a province and a tooltip show the EI plot.

Conclusion

  • We proposed two new plots for distributional data visual analysis
    • Eye Iris plot
      • Pros. It is possible to compare several distributions for each observation.
        Color shadings permits to see some dispersion structure
        Allow comparing some distributional observations.
      • Cons. Number of distributions should not exceed 40-50
        The ordering of sectors (variables) is subjective (but some criteria can be chosen)
    • Distributional heatmap
      • Pros. Allows to see many observations and distributional variables at the same time and to perceive clustering structures or correlation between variables
      • Cons. The more are the variables, the less is possible to perceive internal dispersion of distributional data
  • Future improvements:
    • Erich the plots such that some distributions’ features are more evident (multimodality, skewness, etc.)
    • In clustering: after computing the cluster baricenters and plotting them, we can enrich the central part of EI plot with some statistics related to the dispersion of each distributional variable.
    • A Flower plot, as alternative to EI plot is in production (see an example in the next slide).

A latest proposal Flowers of distribution plot

Flower plots are a polar version of enriched violin plots. They should provide a quicker interpretation w.r.t. the EI plots.

EI iris plot

Flower plot

References

Billard, L., and E. Diday. 2006. Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.
Brito, P., and S. Dias. 2022. Analysis of Distributional Data. Chapman; Hall/CRC. https://doi.org/https://doi.org/10.1201/9781315370545.
Irpino, A. 2018. HistDAWass: Histogram Data Analysis Using Wasserstein Distance.
Irpino, A., and E. Romano. 2007. “Optimal Histogram Representation of Large Data Sets: Fisher Vs Piecewise Linear Approximation.” Revue Des Nouvelles Technologies de l’Information RNTI-E-9: 99–110.
Irpino, A., and R. Verde. 2006. “A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data.” In Data Science and Classification, edited by V. et al. Batagelj, 185–92. Berlin: Springer.
———. 2015. “Basic Statistics for Distributional Symbolic Variables: A New Metric-Based Approach.” Advances in Data Analysis and Classification 9 (2): 143–75.
Verde, R., and A. Irpino. 2018. “Multiple Factor Analysis of Distributional Data.” Statistica Applicata, Italin Journal of Applied Statistics.

Thank you !